This Notebook is intended only for educational purposes, where a supervised classifier will be developed trying to maximize its metrcis. The metrics that will be used for this case will be:
The dataset was provided by Kaggle User Kamal Das, it is a public dataset and is intended as a beginners data set for financial analytics. The description is as follows:
This is a synthetic dataset created using actual data from a financial institution. The data has been modified to remove identifiable features and the numbers transformed to ensure they do not link to original source (financial institution).
This is intended to be used for academic purposes for beginners who want to practice financial analytics from a simple financial dataset.
The variables in the Dataset are described as follows:
import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE
#Load CSV with pandas
df = pd.read_csv("Default_Fin.csv", index_col = 0)
#Let´s see a sample of the data
df.sample(5)
| Employed | Bank Balance | Annual Salary | Defaulted? | |
|---|---|---|---|---|
| Index | ||||
| 1607 | 1 | 0.00 | 398016.72 | 0 |
| 6655 | 1 | 11184.72 | 546063.72 | 0 |
| 3408 | 1 | 5577.12 | 214346.76 | 0 |
| 9632 | 0 | 3903.36 | 191552.52 | 0 |
| 7930 | 0 | 10597.92 | 218556.96 | 0 |
#Basic Dataset statistic information
df.describe()
| Employed | Bank Balance | Annual Salary | Defaulted? | |
|---|---|---|---|---|
| count | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
| mean | 0.705600 | 10024.498524 | 402203.782224 | 0.033300 |
| std | 0.455795 | 5804.579486 | 160039.674988 | 0.179428 |
| min | 0.000000 | 0.000000 | 9263.640000 | 0.000000 |
| 25% | 0.000000 | 5780.790000 | 256085.520000 | 0.000000 |
| 50% | 1.000000 | 9883.620000 | 414631.740000 | 0.000000 |
| 75% | 1.000000 | 13995.660000 | 525692.760000 | 0.000000 |
| max | 1.000000 | 31851.840000 | 882650.760000 | 1.000000 |
By column, we can affirm that:
#Null / Empty Values
df.info()
df.shape
<class 'pandas.core.frame.DataFrame'> Int64Index: 10000 entries, 1 to 10000 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Employed 10000 non-null int64 1 Bank Balance 10000 non-null float64 2 Annual Salary 10000 non-null float64 3 Defaulted? 10000 non-null int64 dtypes: float64(2), int64(2) memory usage: 390.6 KB
(10000, 4)
With the information shown above, it is possible to confirm that the dataset is complete and there are no missing values, since the Non-null count is equal to the total shape.
#Box plots of Data
fig = make_subplots(rows=2, cols=2,
subplot_titles=df.columns)
fig.add_trace(
go.Box(x=df['Employed']),
row=1, col=1)
fig.add_trace(
go.Box(x=df['Bank Balance']),
row=1, col=2)
fig.add_trace(
go.Box(x=df['Annual Salary']),
row=2, col=1)
fig.add_trace(
go.Box(x=df['Defaulted?']),
row=2, col=2)
This boxplots gives us a headstart in how our data is distributed, effectively confirming that the binary columns are in fact binary. Bank Balance seems different than the others, having more data in the 4th Quartile, having more data distributed through the range it may be the feature that may be more useful in our prediction.
#PLot KDEs
fig, axes = plt.subplots(1, 3, figsize=(15,5))
sns.set()
sns.histplot(data=df, x="Employed", kde = True, hue = "Defaulted?", ax=axes[0])
sns.histplot(data=df, x="Bank Balance", kde = True, hue = "Defaulted?", ax=axes[1])
sns.histplot(data=df, x="Annual Salary", kde = True, hue = "Defaulted?", ax=axes[2])
<AxesSubplot:xlabel='Annual Salary', ylabel='Count'>
-From the KDEs shown above we can infere the following:
Employed
Bank Balance
Annual Salary
#Plot pairwise relationships
sns.pairplot(data=df, hue = "Defaulted?", corner=True, plot_kws=dict(marker="+", linewidth=0.5))
<seaborn.axisgrid.PairGrid at 0x18a67339a48>
Looking at the pairwise relationships in the dataset, bank balance looks like it is going to be the main factor to train our model, along with the employed feature.
Several Sci-Kit Learn modules are required, along the XGBoost, we will import them first.
#Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
#Metrics and pipeline
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score, f1_score
Now, we will prepare the data by splitting it into input & output.
#Train & Test
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
Now, a Dictionary will be created where the scaling methods and the models are initialized and score tables as well
#Define Scaling Methods
scalers = {}
scalers['Standard'] = preprocessing.StandardScaler()
scalers['MinMax'] = preprocessing.MinMaxScaler()
scalers['MaxAbs'] = preprocessing.MaxAbsScaler()
scalers['RobustAbs'] = preprocessing.RobustScaler()
scalers['Yeo'] = preprocessing.PowerTransformer(method='yeo-johnson')
#Define Classifiers
classifiers = {}
classifiers['LogReg'] = LogisticRegression()
classifiers['SVM'] = LinearSVC(max_iter=10000)
classifiers['DecTree'] = DecisionTreeClassifier()
classifiers['RandFor'] = RandomForestClassifier()
classifiers['Bayes'] = GaussianNB()
classifiers['KNN'] = KNeighborsClassifier()
classifiers['MLP'] = MLPClassifier(max_iter=10000)
classifiers['XGB'] = xgb.XGBClassifier(objective="binary:logistic", use_label_encoder=False, eval_metric='logloss')
#Create DFs for Scores
scores = {}
scores['Accuracy'] = pd.DataFrame(columns=classifiers.keys(), index=scalers.keys())
scores['AUC'] = pd.DataFrame(columns=classifiers.keys(), index=scalers.keys())
scores['F1'] = pd.DataFrame(columns=classifiers.keys(), index=scalers.keys())
#Train the model and get scores in the Test Data
for model in classifiers.keys():
for scale in scalers.keys():
pipe = make_pipeline(scalers[scale], classifiers[model])
pipe.fit(X_train, y_train)
scores['Accuracy'].loc[scale,model] = pipe.score(X_test, y_test)
scores['AUC'].loc[scale,model] = roc_auc_score(y_test, pipe.predict(X_test))
scores['F1'].loc[scale,model] = f1_score(y_test, pipe.predict(X_test))
scores['Accuracy']
| LogReg | SVM | DecTree | RandFor | Bayes | KNN | MLP | XGB | |
|---|---|---|---|---|---|---|---|---|
| Standard | 0.971818 | 0.971818 | 0.954848 | 0.970606 | 0.969394 | 0.968788 | 0.971515 | 0.969697 |
| MinMax | 0.971818 | 0.971515 | 0.953636 | 0.970303 | 0.969394 | 0.968788 | 0.971515 | 0.969697 |
| MaxAbs | 0.971818 | 0.971515 | 0.954242 | 0.969394 | 0.969394 | 0.968788 | 0.971515 | 0.969697 |
| RobustAbs | 0.971515 | 0.971818 | 0.953939 | 0.970303 | 0.969394 | 0.969394 | 0.971515 | 0.969697 |
| Yeo | 0.971515 | 0.971818 | 0.954242 | 0.970909 | 0.967576 | 0.969091 | 0.971212 | 0.969697 |
At a first glance, every model and scale looks very good. But this scores are for accuracy, which is biased due to the quantity of defaults and no defaults, let's have a look to the other scores.
scores['AUC']
| LogReg | SVM | DecTree | RandFor | Bayes | KNN | MLP | XGB | |
|---|---|---|---|---|---|---|---|---|
| Standard | 0.646714 | 0.601551 | 0.683108 | 0.664153 | 0.654494 | 0.649664 | 0.651074 | 0.672716 |
| MinMax | 0.597035 | 0.587845 | 0.677965 | 0.663996 | 0.654494 | 0.649664 | 0.633009 | 0.672716 |
| MaxAbs | 0.597035 | 0.587845 | 0.682795 | 0.663527 | 0.654494 | 0.649664 | 0.642041 | 0.672716 |
| RobustAbs | 0.642041 | 0.601551 | 0.678122 | 0.65948 | 0.654494 | 0.654494 | 0.637525 | 0.672716 |
| Yeo | 0.637525 | 0.588002 | 0.669246 | 0.664309 | 0.5 | 0.645305 | 0.646401 | 0.672716 |
scores['F1']
| LogReg | SVM | DecTree | RandFor | Bayes | KNN | MLP | XGB | |
|---|---|---|---|---|---|---|---|---|
| Standard | 0.407643 | 0.321168 | 0.360515 | 0.426036 | 0.402367 | 0.390533 | 0.4125 | 0.431818 |
| MinMax | 0.311111 | 0.287879 | 0.348936 | 0.423529 | 0.402367 | 0.390533 | 0.381579 | 0.431818 |
| MaxAbs | 0.311111 | 0.287879 | 0.357447 | 0.416185 | 0.402367 | 0.390533 | 0.397436 | 0.431818 |
| RobustAbs | 0.397436 | 0.321168 | 0.350427 | 0.416667 | 0.402367 | 0.402367 | 0.38961 | 0.431818 |
| Yeo | 0.38961 | 0.290076 | 0.340611 | 0.428571 | 0.0 | 0.385542 | 0.402516 | 0.431818 |
Looking at the F1 and AUC scores gives us a completely different interpretation in our models, ie Bayes with Yeo scaling has an F1 score but a 96% accuracy. We will improve the model's performance by engineering the features of the dataset in the next section.
For this section, the existing features will be modified, scaled and/or transformed and also new features will be created, either by dummying features or binning them.
First, the Annual Salary and the Bank balance will be binned to group them and differentiate the groups who may have different KDEs. Secondly, a new variable will be created to represent a kind of leverage that will tell us the relation between the bank balance and the annual salary.
df_e = df.copy()
#Bin Bank Balance and Annual Salary
bins = {"Annual Salary":4, "Bank Balance":3}
df_e["Annual Salary Bin"] = pd.cut(df_e["Annual Salary"], bins["Annual Salary"], labels=False)
df_e["Bank Balance Bin"] = pd.cut(df_e["Bank Balance"], bins["Bank Balance"], labels=False)
#Create New Variable
df_e["Coverage"] = df_e["Bank Balance"] / df_e["Annual Salary"]
Next,a function will be created to scale the non-categorical values, ie Salary, Bank balnce and the new created "coverage". This function will allow us to loop through all the scaling methods.
#Scale Data Function
def scale_data(method, df_e):
df_e["Salary Scaled"] = method.fit_transform(df_e[["Annual Salary"]])
df_e["Bank Scaled"] = method.fit_transform(df_e[["Bank Balance"]])
df_e["Coverage Scaled"] = method.fit_transform(df_e[["Coverage"]])
return df_e
As discussed above, the models by itself are biased due to the output balance, having a good accuracy but an awful recall. By using the Synthetic Minority Over-sampling Technique, the output categories will be more equally represented, proabbly decreasing accuracy but enhancing other metrics.
# Balance dataset
smote = SMOTE(sampling_strategy='minority')
def resample_smote(df_e):
X_smote, y_smote = smote.fit_resample(df_e.drop(columns=["Defaulted?", "Annual Salary", "Bank Balance", "Coverage"]), df_e["Defaulted?"])
return X_smote, y_smote
Now, we will run through all the scaling methods and all the models
#Run
for scale in scalers.keys():
method = scalers[scale] #define method
df_es = scale_data(method, df_e) #run scaling func
X, y = resample_smote(df_es)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) #define train and test
for model in classifiers.keys():
classifiers[model].fit(X_train, y_train) #fit model
scores['Accuracy'].loc[scale,model] = classifiers[model].score(X_test, y_test) #get accuracy
scores['AUC'].loc[scale,model] = roc_auc_score(y_test, classifiers[model].predict(X_test)) #get AUC
scores['F1'].loc[scale,model] = f1_score(y_test, classifiers[model].predict(X_test)) #get F1
We can see a significant improvement in both AUC and F1, but a little decrease in the accuracy, even though all the models through all the scaling methods have a far better performance.
scores['Accuracy']
| LogReg | SVM | DecTree | RandFor | Bayes | KNN | MLP | XGB | |
|---|---|---|---|---|---|---|---|---|
| Standard | 0.885598 | 0.886225 | 0.908948 | 0.925247 | 0.85974 | 0.918195 | 0.889829 | 0.919292 |
| MinMax | 0.882777 | 0.884344 | 0.906911 | 0.927284 | 0.854882 | 0.920389 | 0.877135 | 0.918195 |
| MaxAbs | 0.883717 | 0.883404 | 0.909732 | 0.928695 | 0.845792 | 0.920389 | 0.883874 | 0.919762 |
| RobustAbs | 0.880426 | 0.88168 | 0.906911 | 0.922583 | 0.851591 | 0.91412 | 0.879329 | 0.919135 |
| Yeo | 0.890613 | 0.893434 | 0.907068 | 0.931515 | 0.855195 | 0.920859 | 0.886695 | 0.921486 |
scores['F1']
| LogReg | SVM | DecTree | RandFor | Bayes | KNN | MLP | XGB | |
|---|---|---|---|---|---|---|---|---|
| Standard | 0.889394 | 0.8901 | 0.911689 | 0.927869 | 0.863213 | 0.921669 | 0.894902 | 0.922311 |
| MinMax | 0.884781 | 0.886566 | 0.908078 | 0.928987 | 0.857011 | 0.92296 | 0.876884 | 0.92011 |
| MaxAbs | 0.885776 | 0.885433 | 0.910891 | 0.929968 | 0.84692 | 0.922867 | 0.886749 | 0.921689 |
| RobustAbs | 0.880539 | 0.881865 | 0.90628 | 0.922837 | 0.850937 | 0.915458 | 0.879913 | 0.919876 |
| Yeo | 0.892846 | 0.895929 | 0.908275 | 0.933027 | 0.864237 | 0.923194 | 0.891065 | 0.923663 |
scores['AUC']
| LogReg | SVM | DecTree | RandFor | Bayes | KNN | MLP | XGB | |
|---|---|---|---|---|---|---|---|---|
| Standard | 0.885275 | 0.885886 | 0.908684 | 0.924901 | 0.859554 | 0.917718 | 0.889282 | 0.918904 |
| MinMax | 0.882722 | 0.884281 | 0.906875 | 0.927204 | 0.854836 | 0.920272 | 0.877157 | 0.918115 |
| MaxAbs | 0.883694 | 0.883381 | 0.909715 | 0.928671 | 0.845783 | 0.920345 | 0.88384 | 0.919729 |
| RobustAbs | 0.880662 | 0.881926 | 0.907032 | 0.922866 | 0.851735 | 0.9146 | 0.879627 | 0.919512 |
| Yeo | 0.890554 | 0.893364 | 0.907034 | 0.931451 | 0.854985 | 0.920769 | 0.886572 | 0.921402 |
scores['F1'].max()
LogReg 0.892846 SVM 0.895929 DecTree 0.911689 RandFor 0.933027 Bayes 0.864237 KNN 0.923194 MLP 0.894902 XGB 0.923663 dtype: object
scores['AUC'].max()
LogReg 0.890554 SVM 0.893364 DecTree 0.909715 RandFor 0.931451 Bayes 0.859554 KNN 0.920769 MLP 0.889282 XGB 0.921402 dtype: object
This may depend on the run, due to the random methods applied, but generally the Random Forest with the MaxAbs scaling has the better perfomance, both in AUC and F1, we will further develop this model to try and get a better performance with hyperparameter tuning.
RandFor = RandomForestClassifier() #initialize model
param_grid = { #hyperparameter dict
'bootstrap': [True], #bootstrap samples are used
'max_depth': [70, 90, 100, 150], #maximum depth of the tree
'max_features': [4, 5], #features to consider
'min_samples_leaf': [2, 5, 10], #number of samples required to be at a leaf node
'min_samples_split': [8, 10, 12], #number of samples required to split an internal node
'n_estimators': [50, 200, 500, 1000] #number of trees
}
#define our grid search
grid_search = GridSearchCV(
estimator = RandFor, #our model
param_grid = param_grid, #parameter to be tuned
cv = 5, #number of folds in cross-validation
n_jobs = -1, #using all processors to run jobs in parallel
verbose = 3) #display the computation time for each fold and parameter candidate is displayed and the score
method = scalers["MaxAbs"] #define method
df_es = scale_data(method, df_e) #run scaling func
X, y = resample_smote(df_es) #get balanced X and Y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) #define train and test
grid_search.fit(X_train, y_train) #fit model
best_model = grid_search.best_estimator_ #get best model
print(best_model.score(X_test, y_test)) #get accuracy
print(roc_auc_score(y_test, best_model.predict(X_test))) #get AUC
print(f1_score(y_test, best_model.predict(X_test))) #get F1
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 16 tasks | elapsed: 38.1s [Parallel(n_jobs=-1)]: Done 112 tasks | elapsed: 5.4min [Parallel(n_jobs=-1)]: Done 272 tasks | elapsed: 11.7min [Parallel(n_jobs=-1)]: Done 496 tasks | elapsed: 20.2min [Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 32.5min [Parallel(n_jobs=-1)]: Done 1136 tasks | elapsed: 49.4min [Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 65.6min finished
0.925716972261401 0.9257338728342779 0.9272336505987104
In this case, the hyperparameter tuning did not helped much, there actaully was an average reduction in all the metrics of around 30 bps, we would have to explore other scaling and hyperparametes grid to try and get a better model. Still a 0.925 AUC, is still pretty good.